First we’ll download the necessary packages, libraries and data.
library (arules)
library(dplyr)
library(arulesViz)
options(scipen = 999)
data(Boston, package = "MASS")
We can look at the different types of variable class within the Boston data
lapply(Boston, class)
## $crim
## [1] "numeric"
##
## $zn
## [1] "numeric"
##
## $indus
## [1] "numeric"
##
## $chas
## [1] "integer"
##
## $nox
## [1] "numeric"
##
## $rm
## [1] "numeric"
##
## $age
## [1] "numeric"
##
## $dis
## [1] "numeric"
##
## $rad
## [1] "integer"
##
## $tax
## [1] "numeric"
##
## $ptratio
## [1] "numeric"
##
## $black
## [1] "numeric"
##
## $lstat
## [1] "numeric"
##
## $medv
## [1] "numeric"
Most variables are shown as numeric, except for chas and rad, we’ll review those and convert them to factors.
unique(Boston$chas)
## [1] 0 1
unique(Boston$rad)
## [1] 1 2 3 5 4 8 6 7 24
We’ll create a new dataset with the new variable class.
b <- Boston
b$chas <- factor(Boston$chas, labels = c("river", "noriver"))
b$rad <- factor(Boston$rad)
The variavle b$black will be cut for better interpretation.
b$black <- cut(Boston$black, breaks = 4, labels = c(">31.5%", "18.5-31.5%", "8-18.5%", "<8%"))
Now we can discretize all the remaining variables in dataset b by putting them into 4 equal-width bins.
**We then pull out chas, rad, and black to mutate the other numeric variables, or put them into the 4 equal-width bins. after the bins are created we’ll put our chas, rad and black variables back into the dataset.
**Last, we’ll turn the dataset b into a transactional dataset.
discrt <-function(x) cut(x, breaks = 4, labels = c("low", "medlow", "medhigh", "High"))
b <- select(b, -c("chas", "rad", "black")) %>%
mutate_all(funs(discrt)) %>%
bind_cols(select(b, c("chas", "rad", "black")))
dim(b)
## [1] 506 14
summary(b)
## crim zn indus nox rm
## low :491 low :429 low :202 low :200 low : 8
## medlow : 10 medlow : 32 medlow :112 medlow :182 medlow :234
## medhigh: 2 medhigh: 16 medhigh:165 medhigh:100 medhigh:236
## High : 3 High : 29 High : 27 High : 24 High : 28
##
##
##
## age dis tax ptratio lstat
## low : 51 low :305 low :240 low : 58 low :243
## medlow : 97 medlow :144 medlow :128 medlow : 68 medlow :187
## medhigh: 96 medhigh: 52 medhigh: 1 medhigh:171 medhigh: 57
## High :262 High : 5 High :137 High :209 High : 19
##
##
##
## medv chas rad black
## low :116 river :471 24 :132 >31.5% : 31
## medlow :284 noriver: 35 5 :115 18.5-31.5%: 8
## medhigh: 74 4 :110 8-18.5% : 15
## High : 32 3 : 38 <8% :452
## 6 : 26
## 2 : 24
## (Other): 61
b <- as(b, "transactions")
Check the columns to make sure they’re all in the bins:
colnames(b)
## [1] "crim=low" "crim=medlow" "crim=medhigh"
## [4] "crim=High" "zn=low" "zn=medlow"
## [7] "zn=medhigh" "zn=High" "indus=low"
## [10] "indus=medlow" "indus=medhigh" "indus=High"
## [13] "nox=low" "nox=medlow" "nox=medhigh"
## [16] "nox=High" "rm=low" "rm=medlow"
## [19] "rm=medhigh" "rm=High" "age=low"
## [22] "age=medlow" "age=medhigh" "age=High"
## [25] "dis=low" "dis=medlow" "dis=medhigh"
## [28] "dis=High" "tax=low" "tax=medlow"
## [31] "tax=medhigh" "tax=High" "ptratio=low"
## [34] "ptratio=medlow" "ptratio=medhigh" "ptratio=High"
## [37] "lstat=low" "lstat=medlow" "lstat=medhigh"
## [40] "lstat=High" "medv=low" "medv=medlow"
## [43] "medv=medhigh" "medv=High" "chas=river"
## [46] "chas=noriver" "rad=1" "rad=2"
## [49] "rad=3" "rad=4" "rad=5"
## [52] "rad=6" "rad=7" "rad=8"
## [55] "rad=24" "black=>31.5%" "black=18.5-31.5%"
## [58] "black=8-18.5%" "black=<8%"
Get a summary of the newly discretized and cute dataset.
summary(b)
## transactions as itemMatrix in sparse format with
## 506 rows (elements/itemsets/transactions) and
## 59 columns (items) and a density of 0.2372881
##
## most frequent items:
## crim=low chas=river black=<8% zn=low dis=low (Other)
## 491 471 452 429 305 4936
##
## element (itemset/transaction) length distribution:
## sizes
## 14
## 506
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14 14 14 14 14 14
##
## includes extended item information - examples:
## labels variables levels
## 1 crim=low crim low
## 2 crim=medlow crim medlow
## 3 crim=medhigh crim medhigh
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
Now we’ll inspect the first 9 transactions.
inspect(b[1:3])
## items transactionID
## [1] {crim=low,
## zn=low,
## indus=low,
## nox=medlow,
## rm=medhigh,
## age=medhigh,
## dis=medlow,
## tax=low,
## ptratio=medlow,
## lstat=low,
## medv=medlow,
## chas=river,
## rad=1,
## black=<8%} 1
## [2] {crim=low,
## zn=low,
## indus=low,
## nox=low,
## rm=medhigh,
## age=High,
## dis=medlow,
## tax=low,
## ptratio=medhigh,
## lstat=low,
## medv=medlow,
## chas=river,
## rad=2,
## black=<8%} 2
## [3] {crim=low,
## zn=low,
## indus=low,
## nox=low,
## rm=medhigh,
## age=medhigh,
## dis=medlow,
## tax=low,
## ptratio=medhigh,
## lstat=low,
## medv=medhigh,
## chas=river,
## rad=2,
## black=<8%} 3
We can plot the frequency of the cut variables as seen below:
itemFrequencyPlot(b, support=.3, cex.names=.8)
Now we’re going to apply the aprior method to the b dataset with a .025% support and 75% confidence, which gives us a minimum support count of 12 and 10 subsets before reaching a maximum.
ars <- apriori(b, parameter = list(support=.025, confidence=.75))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.75 0.1 1 none FALSE TRUE 5 0.025 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 12
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[59 item(s), 506 transaction(s)] done [0.00s].
## sorting and recoding items ... [52 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.03s].
## writing ... [408638 rule(s)] done [0.10s].
## creating S4 object ... done [0.17s].
We can get a summary of the data that includes the number of (x) left-hand-side and (y)right-hand-side rules that satisfy our support and confidence constraints.
summary(ars)
## set of 408638 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4 5 6 7 8 9 10
## 4 293 3650 18932 53620 92554 103550 78411 41677 15947
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.000 7.000 6.846 8.000 10.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.02569 Min. :0.7500 Min. : 0.7799 Min. : 13.00
## 1st Qu.:0.02964 1st Qu.:0.9189 1st Qu.: 1.0743 1st Qu.: 15.00
## Median :0.03755 Median :1.0000 Median : 1.6590 Median : 19.00
## Mean :0.04857 Mean :0.9517 Mean : 1.9759 Mean : 24.58
## 3rd Qu.:0.05534 3rd Qu.:1.0000 3rd Qu.: 2.4211 3rd Qu.: 28.00
## Max. :0.97036 Max. :1.0000 Max. :19.4615 Max. :491.00
##
## mining info:
## data ntransactions support confidence
## b 506 0.025 0.75
As the example notes, we are interested in the association between pollution (NOX) and property value (MEDV) so we’ll find the top 5 rules by confidence with “medv=High”, and “medv=Low” attributes on the rhs:
inspect(head(subset(ars, subset=rhs %in% "medv=High"), 5, by="confidence"))
## lhs rhs support confidence lift count
## [1] {rm=High,
## ptratio=low} => {medv=High} 0.02964427 1 15.8125 15
## [2] {rm=High,
## ptratio=low,
## lstat=low} => {medv=High} 0.02964427 1 15.8125 15
## [3] {rm=High,
## ptratio=low,
## black=<8%} => {medv=High} 0.02964427 1 15.8125 15
## [4] {crim=low,
## rm=High,
## ptratio=low} => {medv=High} 0.02964427 1 15.8125 15
## [5] {rm=High,
## ptratio=low,
## lstat=low,
## black=<8%} => {medv=High} 0.02964427 1 15.8125 15
Here we find the subsets of medv=low, lhs => rhs:
inspect(head(subset(ars, subset=rhs %in% "medv=low"), 5, by="confidence"))
## lhs rhs support confidence lift count
## [1] {nox=medhigh,
## lstat=medhigh} => {medv=low} 0.05928854 1 4.362069 30
## [2] {nox=medhigh,
## lstat=medhigh,
## rad=24} => {medv=low} 0.05928854 1 4.362069 30
## [3] {nox=medhigh,
## tax=High,
## lstat=medhigh} => {medv=low} 0.05928854 1 4.362069 30
## [4] {indus=medhigh,
## nox=medhigh,
## lstat=medhigh} => {medv=low} 0.05928854 1 4.362069 30
## [5] {nox=medhigh,
## ptratio=High,
## lstat=medhigh} => {medv=low} 0.05928854 1 4.362069 30
And now we’ll compare the rhs in the lhs for high pollution (nox=High)
inspect(head(subset(ars, subset=rhs %in% "nox=High" | lhs %in% "nox=High")))
## lhs rhs support confidence lift count
## [1] {nox=High} => {indus=medhigh} 0.04743083 1.0000000 3.0666667 24
## [2] {nox=High} => {rm=medlow} 0.03754941 0.7916667 1.7118946 19
## [3] {nox=High} => {age=High} 0.04743083 1.0000000 1.9312977 24
## [4] {nox=High} => {dis=low} 0.04743083 1.0000000 1.6590164 24
## [5] {nox=High} => {zn=low} 0.04743083 1.0000000 1.1794872 24
## [6] {nox=High} => {black=<8%} 0.03754941 0.7916667 0.8862463 19
Instead of looking at high property value medV=High confidence, we’ll look at support
inspect(head(subset(ars, subset=rhs %in% "medv=High"), 5, by="support"))
## lhs rhs support confidence
## [1] {rm=High} => {medv=High} 0.04743083 0.8571429
## [2] {rm=High,lstat=low} => {medv=High} 0.04743083 0.8571429
## [3] {rm=High,black=<8%} => {medv=High} 0.04743083 0.8571429
## [4] {crim=low,rm=High} => {medv=High} 0.04743083 0.8571429
## [5] {rm=High,lstat=low,black=<8%} => {medv=High} 0.04743083 0.8571429
## lift count
## [1] 13.55357 24
## [2] 13.55357 24
## [3] 13.55357 24
## [4] 13.55357 24
## [5] 13.55357 24
Now we’ll look at rules generated from maximal and closed itemsets:
We’ll start with maximal itemsets that are at our support constraint and somewhat above the confidence constraint. The Maximal count is 13.
inspect(head(subset(ars, subset=is.maximal(ars), 5, by="confidence")))
## lhs rhs support
## [1] {zn=low,lstat=medlow,chas=noriver} => {crim=low} 0.0256917
## [2] {crim=low,lstat=medlow,chas=noriver} => {zn=low} 0.0256917
## [3] {rm=medhigh,chas=noriver,black=<8%} => {crim=low} 0.0256917
## [4] {crim=low,rm=medhigh,chas=noriver} => {black=<8%} 0.0256917
## [5] {rm=medlow,ptratio=low,medv=medlow} => {crim=low} 0.0256917
## [6] {indus=medhigh,age=medhigh,chas=river} => {crim=low} 0.0256917
## confidence lift count
## [1] 1 1.030550 13
## [2] 1 1.179487 13
## [3] 1 1.030550 13
## [4] 1 1.119469 13
## [5] 1 1.030550 13
## [6] 1 1.030550 13
To find our closed datasets we need to find find and pull out the most frequent itemsets:, 52 items are recorded:
freq.itemsets <- apriori(b, parameter = list(target="frequent itemsets", support=.025))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.025 1
## maxlen target ext
## 10 frequent itemsets FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 12
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[59 item(s), 506 transaction(s)] done [0.00s].
## sorting and recoding items ... [52 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.11s].
## writing ... [106259 set(s)] done [0.03s].
## creating S4 object ... done [0.09s].
Now we can find and review the closed itemsets, and of the 10 subsets, the first has a total of 13 closed itemetems.
inspect(head(subset(ars, subset=is.closed(freq.itemsets), 5, by="confidence")))
## lhs rhs support confidence lift count
## [1] {rad=1} => {black=<8%} 0.03952569 1.0000000 1.1194690 20
## [2] {nox=High} => {indus=medhigh} 0.04743083 1.0000000 3.0666667 24
## [3] {nox=High} => {dis=low} 0.04743083 1.0000000 1.6590164 24
## [4] {nox=High} => {zn=low} 0.04743083 1.0000000 1.1794872 24
## [5] {nox=High} => {black=<8%} 0.03754941 0.7916667 0.8862463 19
## [6] {rad=2} => {ptratio=medhigh} 0.03557312 0.7500000 2.2192982 18
closed = freq.itemsets[is.closed(freq.itemsets)]
summary(closed)
## set of 11351 itemsets
##
## most frequent items:
## crim=low black=<8% zn=low chas=river dis=low (Other)
## 9822 7839 7794 6833 4237 44615
##
## element (itemset/transaction) length distribution:sizes
## 1 2 3 4 5 6 7 8 9 10
## 13 72 239 632 1309 2059 2346 1782 829 2070
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.000 7.000 7.148 9.000 10.000
##
## summary of quality measures:
## support count
## Min. :0.02569 Min. : 13.00
## 1st Qu.:0.03360 1st Qu.: 17.00
## Median :0.04743 Median : 24.00
## Mean :0.06915 Mean : 34.99
## 3rd Qu.:0.07708 3rd Qu.: 39.00
## Max. :0.97036 Max. :491.00
##
## includes transaction ID lists: FALSE
##
## mining info:
## data ntransactions support confidence
## b 506 0.025 1
We can compare the frequencies to the maximum subsets which start at 4 with a maximum count of 4 transactions.
maximal = freq.itemsets[is.maximal(freq.itemsets)]
summary(maximal)
## set of 2949 itemsets
##
## most frequent items:
## crim=low chas=river zn=low black=<8% dis=low (Other)
## 2401 2247 2238 2093 1755 16939
##
## element (itemset/transaction) length distribution:sizes
## 4 5 6 7 8 9 10
## 4 21 57 184 295 318 2070
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 9.000 10.000 9.384 10.000 10.000
##
## summary of quality measures:
## support count
## Min. :0.02569 Min. :13.00
## 1st Qu.:0.02569 1st Qu.:13.00
## Median :0.02964 Median :15.00
## Mean :0.03450 Mean :17.46
## 3rd Qu.:0.03755 3rd Qu.:19.00
## Max. :0.14032 Max. :71.00
##
## includes transaction ID lists: FALSE
##
## mining info:
## data ntransactions support confidence
## b 506 0.025 1
Now we can look at shorter rules:
inspect(head(subset(ars, subset= size(lhs) <5 & size(lhs) >1), 5, by="support"))
## lhs rhs support confidence lift
## [1] {chas=river,black=<8%} => {crim=low} 0.8083004 0.9784689 1.0083610
## [2] {crim=low,black=<8%} => {chas=river} 0.8083004 0.9232506 0.9918573
## [3] {crim=low,chas=river} => {black=<8%} 0.8083004 0.8969298 1.0040852
## [4] {zn=low,chas=river} => {crim=low} 0.7569170 0.9623116 0.9917101
## [5] {crim=low,zn=low} => {chas=river} 0.7569170 0.9251208 0.9938665
## count
## [1] 409
## [2] 409
## [3] 409
## [4] 383
## [5] 383
And modify the previous review by raising the lift.
inspect(head(subset(ars, subset=size(lhs)<5 & size(lhs) >1 & lift >2), 5, by="support"))
## lhs rhs support confidence
## [1] {nox=low,black=<8%} => {indus=low} 0.3221344 0.8150000
## [2] {indus=low,black=<8%} => {nox=low} 0.3221344 0.8069307
## [3] {crim=low,nox=low} => {indus=low} 0.3221344 0.8150000
## [4] {crim=low,indus=low} => {nox=low} 0.3221344 0.8069307
## [5] {crim=low,nox=low,black=<8%} => {indus=low} 0.3221344 0.8150000
## lift count
## [1] 2.041535 163
## [2] 2.041535 163
## [3] 2.041535 163
## [4] 2.041535 163
## [5] 2.041535 163
first we’ll review the interactive scatter plot for all 408638 rules.
plot(ars, engine = "htmlwidget", jitter = 0)
Now we’ll look at a grouped matrix using new constraints. We use this to compare the support and lift association of lhs and rhs.
somerules <- subset(ars, subset=size(lhs)>1 & confidence >.9 & support >0.5)
plot(somerules, method = "grouped")
This matrix with the top 21 rules shows the many antecedents that are found within the 4 consequents, {chas=river}" “{crim=low}” “{black=<8%}” “{zn=low}:
plot(somerules, method = "matrix")
## Itemsets in Antecedent (LHS)
## [1] "{crim=low,dis=low,chas=river}" "{crim=low,dis=low}"
## [3] "{dis=low,chas=river}" "{crim=low,medv=medlow}"
## [5] "{medv=medlow,black=<8%}" "{medv=medlow,chas=river}"
## [7] "{chas=river,black=<8%}" "{zn=low,chas=river,black=<8%}"
## [9] "{zn=low,black=<8%}" "{crim=low,zn=low}"
## [11] "{crim=low,black=<8%}" "{zn=low,chas=river}"
## [13] "{crim=low,zn=low,black=<8%}" "{zn=low,dis=low}"
## [15] "{crim=low,zn=low,dis=low}" "{zn=low,dis=low,chas=river}"
## Itemsets in Consequent (RHS)
## [1] "{chas=river}" "{crim=low}" "{black=<8%}" "{zn=low}"
Use the 4 consequents to create a network graph.
plot(somerules, method = "graph", engine = "htmlwidget")
After going through this exercise, perform association rule learning on your dataset. Turn in both the R code for the exercise, and the R code for the practice using your datasets. You want to explore different thresholds, use the interactive vis tools provided by arulesViz, and find and report at least two interesting association rules from your dataset.
rm(list=ls())
load("LifeExpectancyData_3.Rdata")
data$Status = as.factor(data$Status)
data$Year = as.factor(data$Year)
data$Country = as.factor(data$Country)
data$Life <- cut(data$`Life expectancy`,
breaks = 3,
labels=c('low LE', 'middle LE', 'high LE'))
data$infantmort <- cut(data$`infant deaths`,
breaks = 3,
labels=c('low_infant', 'middle_infant', 'high_infant'))
data$expenditure <- cut(data$`Total expenditure`,
breaks = 3,
labels=c('low_expenditure','med_expenditure', 'high_expenditure'))
data$pop <- cut(data$Population,
breaks = 3,
labels=c('low_pop','med_pop', 'high_pop'))
data$bmi <- cut(data$BMI,
breaks = 3,
labels=c('low_BMI','med_BMI', 'high_BMI'))
data$gdp <- cut(data$GDP,
breaks = 3,
labels=c('low_GDP','med_GDP', 'high_GDP'))
data$alcohol <- cut(data$Alcohol,
breaks = 3,
labels=c('low_alc','med_alc', 'high_alc'))
drops <- c("Alcohol",
"percentage expenditure",
"BMI",
"GDP",
"Diphtheria",
"Life expectancy",
"Measles",
"infant deaths",
"Total expenditure",
"thinness 1-19 years",
"thinness 5-9 years",
"Population",
"HIV/AIDS",
"Polio",
"under-five deaths",
"Adult Mortality",
"Income composition of resources",
"Schooling")
data <- data[ , !(names(data) %in% drops)]
transform the dataframe b to a transactions dataset, where each row is described by a set of binary variables (this is “bitmap indexing” we learned in Chapter 4 in the textbook) transactions data are often very large and sparse, directly looking at it won’t give your much information.You can see how the columns are constructed by using colnames(), or see a summary() of it. To see the records, use inspect(): inspect(b[1:9]) show the first 9 transactions.
b <- as(as.data.frame(data), "transactions")
inspect(b[1:3])
## items transactionID
## [1] {Country=Afghanistan,
## Year=2015,
## Status=Developing,
## region=Asia,
## Life=middle LE,
## infantmort=low_infant,
## expenditure=med_expenditure,
## pop=low_pop,
## bmi=low_BMI,
## gdp=low_GDP,
## alcohol=low_alc} 1
## [2] {Country=Afghanistan,
## Year=2014,
## Status=Developing,
## region=Asia,
## Life=middle LE,
## infantmort=low_infant,
## expenditure=med_expenditure,
## pop=low_pop,
## bmi=low_BMI,
## gdp=low_GDP,
## alcohol=low_alc} 2
## [3] {Country=Afghanistan,
## Year=2013,
## Status=Developing,
## region=Asia,
## Life=middle LE,
## infantmort=low_infant,
## expenditure=med_expenditure,
## pop=low_pop,
## bmi=low_BMI,
## gdp=low_GDP,
## alcohol=low_alc} 3
ars <- apriori(b, parameter = list(support=0.30, confidence=0.75))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.75 0.1 1 none FALSE TRUE 5 0.3 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 881
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[236 item(s), 2938 transaction(s)] done [0.00s].
## sorting and recoding items ... [11 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [222 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
#222 rules created
#find rules generated from maximal/closed itemsets:
#maximal itemsets
inspect(head(subset(ars, subset=is.maximal(ars), 5, by="support")))
## lhs rhs support confidence lift count
## [1] {Status=Developing,
## bmi=med_BMI} => {infantmort=low_infant} 0.3430905 1.0000000 1.0054757 1008
## [2] {infantmort=low_infant,
## bmi=med_BMI} => {Status=Developing} 0.3430905 0.8083400 0.9789377 1008
## [3] {Status=Developing,
## Life=high LE} => {infantmort=low_infant} 0.3573860 1.0000000 1.0054757 1050
## [4] {Status=Developing,
## bmi=low_BMI,
## alcohol=low_alc} => {infantmort=low_infant} 0.3012253 0.9833333 0.9887178 885
## [5] {infantmort=low_infant,
## bmi=low_BMI,
## alcohol=low_alc} => {Status=Developing} 0.3012253 0.9899329 1.1988552 885
## [6] {Status=Developing,
## infantmort=low_infant,
## bmi=low_BMI} => {alcohol=low_alc} 0.3012253 0.8171745 1.3641243 885
inspect(head(subset(ars, subset=rhs %in% "Life=low LE"), 5, by="support"))
inspect(head(subset(ars, subset=rhs %in% "Life=middle LE"), 5, by="support"))
inspect(head(subset(ars, subset=rhs %in% "Life=high LE"), 5, by="support"))
# no rules associated with these
# need to find freq itemsets to find closed itemsets:
freq.itemsets <- apriori(b, parameter=list(target="frequent itemsets", support=0.25))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.25 1
## maxlen target ext
## 10 frequent itemsets FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 734
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[236 item(s), 2938 transaction(s)] done [0.00s].
## sorting and recoding items ... [14 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [163 set(s)] done [0.00s].
## creating S4 object ... done [0.00s].
closed = freq.itemsets[is.closed(freq.itemsets)]
inspect(head(subset(freq.itemsets, subset=is.closed(freq.itemsets), 5, by="support")))
## items support count
## [1] {region=Asia} 0.2559564 752
## [2] {Life=middle LE} 0.3801906 1117
## [3] {bmi=low_BMI} 0.3982301 1170
## [4] {expenditure=low_expenditure} 0.5125936 1506
## [5] {alcohol=low_alc} 0.5990470 1760
## [6] {pop=low_pop} 0.7763785 2281
#find maximal itemsets
maximal = freq.itemsets[is.maximal(freq.itemsets)]
summary(maximal)
## set of 17 itemsets
##
## most frequent items:
## infantmort=low_infant Status=Developing gdp=low_GDP
## 15 9 9
## pop=low_pop Life=high LE (Other)
## 7 4 20
##
## element (itemset/transaction) length distribution:sizes
## 2 3 4 5 6
## 3 4 5 4 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.765 5.000 6.000
##
## summary of quality measures:
## support count
## Min. :0.2505 Min. : 736.0
## 1st Qu.:0.2512 1st Qu.: 738.0
## Median :0.2828 Median : 831.0
## Mean :0.2791 Mean : 820.1
## 3rd Qu.:0.2941 3rd Qu.: 864.0
## Max. :0.3485 Max. :1024.0
##
## includes transaction ID lists: FALSE
##
## mining info:
## data ntransactions support confidence
## b 2938 0.25 1
#check shorter rules
inspect(head(subset(ars, subset= size(lhs)<5 & size(lhs) >1), 5, by="support"))
## lhs rhs support confidence lift count
## [1] {pop=low_pop,
## gdp=low_GDP} => {infantmort=low_infant} 0.7307692 0.9949027 1.000350 2147
## [2] {infantmort=low_infant,
## pop=low_pop} => {gdp=low_GDP} 0.7307692 0.9458150 1.176961 2147
## [3] {infantmort=low_infant,
## gdp=low_GDP} => {pop=low_pop} 0.7307692 0.9155650 1.179277 2147
## [4] {Status=Developing,
## gdp=low_GDP} => {infantmort=low_infant} 0.6773315 0.9920239 0.997456 1990
## [5] {Status=Developing,
## infantmort=low_infant} => {gdp=low_GDP} 0.6773315 0.8257261 1.027524 1990
#note the above rules have high support and confidence but low lift.
inspect(head(subset(ars, subset= size(lhs)<5 & size(lhs) >1 & lift > 1), 5, by="support"))
## lhs rhs support confidence lift count
## [1] {pop=low_pop,
## gdp=low_GDP} => {infantmort=low_infant} 0.7307692 0.9949027 1.000350 2147
## [2] {infantmort=low_infant,
## pop=low_pop} => {gdp=low_GDP} 0.7307692 0.9458150 1.176961 2147
## [3] {infantmort=low_infant,
## gdp=low_GDP} => {pop=low_pop} 0.7307692 0.9155650 1.179277 2147
## [4] {Status=Developing,
## infantmort=low_infant} => {gdp=low_GDP} 0.6773315 0.8257261 1.027524 1990
## [5] {infantmort=low_infant,
## gdp=low_GDP} => {Status=Developing} 0.6773315 0.8486141 1.027712 1990
Plotting rules by confidence and support
plot(ars, engine = "htmlwidget", jitter = 0)
#grouped
somerules <- subset(ars, subset=size(lhs) & confidence>0.90 & support>0.3)
plot(somerules, method="grouped")
plot(somerules, method="graph", engine="htmlwidget")
The Association rules produced from our data set have quite low lift, meaning the rules aren’t relatively important. Whereas in the Boston dataset with lifts of over 2, our own dataset only show a maximum lift of around 1.2.
In addition, our main variable of interest, life expectancy, has very few rules with higher support. When it does show up (with low support), it associated with low infant mortality, low gdp, and in developing countries. However, the stronger rules below are more associated with infant mortality rates than life expectancy.
The Association rules produced from our data set have quite low lift, meaning the rules aren’t relatively important. Whereas in the Boston dataset with lifts of over 2, our own dataset only show a maximum lift of around 1.2. In addition, our main variable of interest, life expectancy, has very few rules with higher support.
Rule 1 {Status=Developing, infantmort=low_infant, expenditure=low_expenditure} => {alcohol=low_alc}
support: 0.385
confidence: 0.811
lift: 1.35
Rule 2 {Status=Developing, Life=middle LE, infantmort=low_infanct, gdp=low_GDP} => {pop=low_pop}
support: 0.307
confidence: 0.991
lift: 1.28
Rule 3 {Status=Developing, infantmort=low_infanct, bmi=low_BMI} => {pop=low_pop}
support: 0.302
confidence: 0.818
lift: 1.05